Xorbits is a powerful tool for exploring and analyzing large datasets. One of the classic datasets for demonstrating the capabilities of Xorbits is the NYC taxi dataset, which contains records of taxi rides in the city from 2009 to 2022. In this blog, we will explore how to use Xorbits to do some initial exploration of the NYC taxi dataset and get a sense of what kind of insights we might be able to gain from the data.
# Install dependencies
%pip install xorbits==0.1.0 plotly==5.11.0
Download the taxi zone lookup table and the taxi zone geojson:
%%bash
wget https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv
wget https://data.cityofnewyork.us/api/geospatial/d3c5-ddgc\?method\=export\&format\=GeoJSON -O taxi_zones.geojson
You can start with yellow taxi trip records of a month:
%%bash
wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-12.parquet
You can also use records of a year:
%%bash
for i in {1..12}
do
wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-$(printf "%02d" $i).parquet
done
The first step is to initialize Xorbits:
import xorbits
# Initialize Xorbits in the local environment.
xorbits.init()
The second step is to load the data into an Xorbits DataFrame. This can be done using the
read_parquet() function, which allows us to specify the location of the parquet file and
any additional options we want to use while reading the data.
In the case of the NYC taxi dataset, here is an example of how we could do this using Xorbits:
import datetime
import json
import xorbits.pandas as pd
trips = pd.read_parquet("yellow_tripdata_2021-*.parquet")
# Remove outliers
trips = trips[(trips['tpep_pickup_datetime'] >= datetime.datetime(2021, 1, 1)) & (trips['tpep_pickup_datetime'] <= datetime.datetime(2021, 12, 31))]
taxi_zones = pd.read_csv('taxi+_zone_lookup.csv')
with open('taxi_zones.geojson') as fd:
geojson = json.load(fd)
Once we have the data loaded into a DataFrame, we might want to get a sense of the overall structure of the data by looking at the number of rows and columns, the data types of each column, and the first few rows of the data. We can do this using the shape, dtypes, and head() attributes, respectively:
print(trips.shape)
print(trips.dtypes)
print(trips.head())
(nan, 19) VendorID int64 tpep_pickup_datetime datetime64[ns] tpep_dropoff_datetime datetime64[ns] passenger_count float64 trip_distance float64 RatecodeID float64 store_and_fwd_flag object PULocationID int64 DOLocationID int64 payment_type int64 fare_amount float64 extra float64 mta_tax float64 tip_amount float64 tolls_amount float64 improvement_surcharge float64 total_amount float64 congestion_surcharge float64 airport_fee float64 dtype: object
0%| | 0.00/100 [00:00<?, ?it/s]
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \ 0 1 2021-12-01 00:19:51 2021-12-01 00:37:01 1.0 1 2 2021-12-01 00:29:07 2021-12-01 00:45:13 2.0 2 1 2021-12-01 00:12:40 2021-12-01 00:27:17 0.0 3 1 2021-12-01 00:10:18 2021-12-01 00:19:20 1.0 4 1 2021-12-01 00:25:12 2021-12-01 00:39:07 1.0 trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID \ 0 9.30 1.0 N 138 141 1 2.76 1.0 N 238 42 2 3.40 1.0 N 239 74 3 1.30 1.0 N 148 87 4 3.10 1.0 N 231 246 payment_type fare_amount extra mta_tax tip_amount tolls_amount \ 0 1 26.5 4.25 0.5 7.60 6.55 1 2 13.0 0.50 0.5 0.00 0.00 2 1 13.5 3.00 0.5 2.00 0.00 3 1 6.5 3.00 0.5 2.05 0.00 4 1 12.5 3.00 0.5 2.50 0.00 improvement_surcharge total_amount congestion_surcharge airport_fee 0 0.3 45.70 2.5 1.25 1 0.3 16.80 2.5 0.00 2 0.3 19.30 2.5 0.00 3 0.3 12.35 2.5 0.00 4 0.3 18.80 2.5 0.00
We can also use the describe() method to get a summary of the statistical properties of
each numerical column in the dataset.
print(trips.describe())
0%| | 0.00/100 [00:00<?, ?it/s]
VendorID passenger_count trip_distance RatecodeID \
count 3.143916e+06 3.043266e+06 3.143916e+06 3.043266e+06
mean 1.701623e+00 1.440766e+00 7.190938e+00 1.176653e+00
std 4.912976e-01 1.016665e+00 7.283667e+02 3.266358e+00
min 1.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00
25% 1.000000e+00 1.000000e+00 1.100000e+00 1.000000e+00
50% 2.000000e+00 1.000000e+00 1.840000e+00 1.000000e+00
75% 2.000000e+00 2.000000e+00 3.430000e+00 1.000000e+00
max 6.000000e+00 9.000000e+00 3.350937e+05 9.900000e+01
PULocationID DOLocationID payment_type fare_amount extra \
count 3.143916e+06 3.143916e+06 3.143916e+06 3.143916e+06 3.143916e+06
mean 1.647513e+02 1.625591e+02 1.213615e+00 1.429232e+01 1.058308e+00
std 6.571399e+01 7.032944e+01 5.166679e-01 1.361084e+01 1.271117e+00
min 1.000000e+00 1.000000e+00 0.000000e+00 -6.060000e+02 -4.500000e+00
25% 1.320000e+02 1.130000e+02 1.000000e+00 7.000000e+00 0.000000e+00
50% 1.620000e+02 1.620000e+02 1.000000e+00 1.000000e+01 5.000000e-01
75% 2.340000e+02 2.340000e+02 1.000000e+00 1.600000e+01 2.500000e+00
max 2.650000e+02 2.650000e+02 5.000000e+00 3.009000e+03 3.050000e+01
mta_tax tip_amount tolls_amount improvement_surcharge \
count 3.143916e+06 3.143916e+06 3.143916e+06 3.143916e+06
mean 4.903896e-01 2.596015e+00 5.148399e-01 2.968758e-01
std 8.476900e-02 3.128478e+00 2.021245e+00 4.275566e-02
min -5.500000e-01 -9.800000e+01 -5.100000e+01 -3.000000e-01
25% 5.000000e-01 0.000000e+00 0.000000e+00 3.000000e-01
50% 5.000000e-01 2.060000e+00 0.000000e+00 3.000000e-01
75% 5.000000e-01 3.260000e+00 0.000000e+00 3.000000e-01
max 3.300000e+00 4.140000e+02 8.000700e+02 3.000000e-01
total_amount congestion_surcharge airport_fee
count 3.143916e+06 3.043266e+06 3.043266e+06
mean 2.090387e+01 2.301929e+00 8.583418e-02
std 1.689414e+01 7.132180e-01 3.183060e-01
min -6.068000e+02 -2.500000e+00 -1.250000e+00
25% 1.184000e+01 2.500000e+00 0.000000e+00
50% 1.580000e+01 2.500000e+00 0.000000e+00
75% 2.230000e+01 2.500000e+00 0.000000e+00
max 3.012300e+03 2.500000e+00 1.250000e+00
One way to analyze the NYC taxi dataset is to look at how the number of rides varies over time.
We can do this by creating a new column in the DataFrame that represents the pick-up date of
each ride, and then use the groupby method to group the data by month or year and compute the
count of rides for each group:
trips['PU_date'] = trips['tpep_pickup_datetime'].dt.date
count = trips.groupby('PU_date', as_index=False).agg(count=('VendorID', 'count'))
print(count)
0%| | 0.00/100 [00:00<?, ?it/s]
PU_date count 0 2021-11-30 54 1 2021-12-01 121986 2 2021-12-02 128647 3 2021-12-03 136231 4 2021-12-04 132392 5 2021-12-05 109654 6 2021-12-06 112665 7 2021-12-07 126649 8 2021-12-08 131662 9 2021-12-09 134724 10 2021-12-10 133497 11 2021-12-11 130247 12 2021-12-12 107167 13 2021-12-13 111299 14 2021-12-14 120218 15 2021-12-15 125048 16 2021-12-16 126270 17 2021-12-17 124147 18 2021-12-18 110279 19 2021-12-19 91200 20 2021-12-20 94213 21 2021-12-21 93093 22 2021-12-22 91699 23 2021-12-23 87850 24 2021-12-24 69619 25 2021-12-25 40807 26 2021-12-26 54705 27 2021-12-27 69897 28 2021-12-28 75037 29 2021-12-29 74989 30 2021-12-30 77968 31 2021-12-31 3
We can then use a library like plotly to visualize the time series data:
import plotly.express as px
b = px.bar(count.to_pandas(), x='PU_date', y='count')
b.show()
Another way to analyze the NYC taxi dataset is to look at patterns in the spatial distribution of rides. Taking Manhattan as an example, we firstly filter the dataframe by pick-up location ID:
manhattan_zones = taxi_zones[taxi_zones['Borough'] == 'Manhattan']['LocationID']
manhattan_trips = trips[trips['PULocationID'].isin(manhattan_zones)]
print(manhattan_trips)
0%| | 0.00/100 [00:00<?, ?it/s]
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \
1 2 2021-12-01 00:29:07 2021-12-01 00:45:13 2.0
2 1 2021-12-01 00:12:40 2021-12-01 00:27:17 0.0
3 1 2021-12-01 00:10:18 2021-12-01 00:19:20 1.0
4 1 2021-12-01 00:25:12 2021-12-01 00:39:07 1.0
6 2 2021-12-01 00:00:08 2021-12-01 00:18:27 2.0
... ... ... ... ...
3212643 2 2021-12-30 23:04:37 2021-12-30 23:22:59 NaN
3212644 1 2021-12-30 23:31:58 2021-12-30 23:37:58 NaN
3212645 1 2021-12-30 23:53:54 2021-12-31 00:03:17 NaN
3212646 2 2021-12-30 23:28:00 2021-12-30 23:43:00 NaN
3212672 2 2021-12-31 00:00:00 2021-12-31 00:08:00 NaN
trip_distance RatecodeID store_and_fwd_flag PULocationID \
1 2.76 1.0 N 238
2 3.40 1.0 N 239
3 1.30 1.0 N 148
4 3.10 1.0 N 231
6 5.18 1.0 N 114
... ... ... ... ...
3212643 7.01 NaN None 42
3212644 1.30 NaN None 164
3212645 2.30 NaN None 263
3212646 4.46 NaN None 246
3212672 1.02 NaN None 163
DOLocationID payment_type fare_amount extra mta_tax tip_amount \
1 42 2 13.00 0.5 0.5 0.00
2 74 1 13.50 3.0 0.5 2.00
3 87 1 6.50 3.0 0.5 2.05
4 246 1 12.50 3.0 0.5 2.50
6 239 1 17.50 0.5 0.5 4.26
... ... ... ... ... ... ...
3212643 70 0 32.15 0.0 0.5 0.00
3212644 90 0 6.50 0.5 0.5 1.03
3212645 151 0 9.50 0.5 0.5 1.33
3212646 87 0 17.34 0.0 0.5 4.42
3212672 229 0 13.20 0.0 0.5 0.00
tolls_amount improvement_surcharge total_amount \
1 0.00 0.3 16.80
2 0.00 0.3 19.30
3 0.00 0.3 12.35
4 0.00 0.3 18.80
6 0.00 0.3 25.56
... ... ... ...
3212643 6.55 0.3 39.50
3212644 0.00 0.3 11.33
3212645 0.00 0.3 14.63
3212646 0.00 0.3 25.06
3212672 0.00 0.3 16.50
congestion_surcharge airport_fee PU_date
1 2.5 0.0 2021-12-01
2 2.5 0.0 2021-12-01
3 2.5 0.0 2021-12-01
4 2.5 0.0 2021-12-01
6 2.5 0.0 2021-12-01
... ... ... ...
3212643 NaN NaN 2021-12-30
3212644 NaN NaN 2021-12-30
3212645 NaN NaN 2021-12-30
3212646 NaN NaN 2021-12-30
3212672 NaN NaN 2021-12-31
[2827015 rows x 20 columns]
Then use the groupby method to group the data by pick-up location ID and compute the count of
rides for each group:
gb_pu_location = manhattan_trips.groupby(['PULocationID'], as_index=False).agg(count=('VendorID', 'count')).to_pandas()
0%| | 0.00/100 [00:00<?, ?it/s]
We can then use a library like plotly to visualize the spatial distribution of rides:
m = px.choropleth(
gb_pu_location,
geojson=geojson,
locations='PULocationID',
featureidkey='properties.location_id',
color='count',
color_continuous_scale="Viridis",
range_color=(0, gb_pu_location['count'].max()),
labels={'count':'trips count'}
)
m.update_geos(fitbounds="locations", visible=False)
m.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
m.show()